The Impact of Catchy YouTube Titles and Thumbnails on Viewer Engagement

VIDEO PRESENTATION

INTRODUCTION

YouTube has been the leader among content creation platforms for decades, witnessing a surge in the variety of uploaded content. YouTube demands creators to showcase their talent and optimize their content to increase the number of views, ultimately maximizing its reach. This research aims to determine the most crucial factors in boosting the number of views, with a primary focus on elements presented to users - specifically, thumbnails and titles (refer Fig. 1). While video content, watch time, likes, shares, and subscribers play significant roles in YouTube's video recommendation system, this study examines whether titles and thumbnails have a noteworthy impact on converting a video recommendation into an actual view.

The author's previous research establishes that titles indeed play a role in influencing views. The study observed that shorter titles gained more traction compared to longer ones. The current study aims to build upon this research by extending the idea of titles' influence on views and incorporating the characteristics of thumbnails. In the journey also explore the other performance metrics such as likes, subscribers and number of comments on a video and its relation to title and views.

Fig. 1 Youtube video format intro_1

Fig. 2 Youtube Analytics

YouTube Analytics Application is a platform (Fig. 2) created by Google for YouTube content creators to analyze their data. It displays basic analytics such as the number of views, shares, likes, dislikes, subscribers, and click-through rate. Additionally, it provides content creators with the freedom to visualize their data to gain more insights into the relationship between variables. The platform offers in-depth analysis of video performance, aiding content creators in improving video quality and content. Despite these features, there is currently no tool or analytics within the platform specifically designed to help creators craft effective titles and thumbnails – a critical aspect of content creation.

Several studies have examined the impact of YouTube thumbnails and titles on views and video performance. Cheng et al. (2015) demonstrated that both titles and thumbnails significantly influence video views, with thumbnails being slightly more influential. Cheng et al. investigated the impact of YouTube metadata on views using statistical analysis rather than applying machine learning concepts. Koh et al. (2018) emphasized the importance of visually appealing thumbnails in enhancing video performance over time. Their study suggests that creating visually appealing thumbnails does indeed impact video performance, as visually appealing videos were observed to perform better. The study also employed a statistical approach to understanding the impact of thumbnails on views. Dong et al. (2024) conducted research similar to Koh et al. to study the impact of visually appealing thumbnails on views. They observed a significant correlation between thumbnail design and video metrics, notably emphasizing the positive impact of a distant background, focusing on one person, and using concise headings. Similar to the other papers, this study approached the effect through a statistical analysis standpoint.

The focus of this research is to develop a tool that assists content creators in evaluating the quality of their titles and thumbnails. Additionally, the research aims to understand the impact that titles and thumbnails can have on users' mindsets, influencing their decision to either click or ignore the video. The fundamental question regarding the influence of titles and thumbnails on views leads to several intriguing research inquiries, outlined below. The project aims to shed light on the types of sentences and images that attract human attention, thereby affecting the click-through rate. This understanding could prove valuable across various domains, such as advertising, enabling brands to comprehend the impact of the text and images they employ in their advertisements and enhance them based on the findings of this research.

Such a tool could aid YouTube content creators in tailoring titles and thumbnails to any category of videos by analyzing their effectiveness in gaining traction and ultimately popularity on the platform. Though the tool does not guarantee success on YouTube, it assists in ensuring that visible aspects to the audience are tailored according to their needs, potentially increasing views. Research in the field of the impact of titles and thumbnails on views is not only important for creating a tool to enhance content creators' channel views but also for understanding the psychological impact of text and images on human behavior and its influence on video click-through rates. Understanding these psychological aspects could lead to the development of stronger advertising and marketing models.

Research Questions

Through this approach, the model seeks to gain insights into the human mindset, specifically addressing the question: 'What prompts a user to click on a YouTube video once recommended?' Breaking it down into 10 research questions,

How does the design and content of thumbnails influence click-through rates?

Are certain colors, images, or styles more effective in attracting clicks?

How does the length of a video title correlate with increased views?

To what extent does the synergy between thumbnails and titles contribute to higher view counts?

Are there specific combinations of thumbnails and titles that perform exceptionally well?

How do engagement metrics, such as likes, comments, and shares, correlate with the effectiveness of thumbnails and titles?

Do the effects of thumbnails and titles on initial views extend to long-term video success?

What is the realtion between recommendation system and impact of thumbnails and titles?

Are there certain times of the day or week when these elements have a greater impact?

Are there demographic factors that influence these preferences?

The research questions stated above should give a deeper understanding of the underlying reasons behind viewer engagement and the intricate dynamics influencing the success of YouTube videos. The data collected to analyse and answer the above questions are collected from various sources. The following section will examine the data collection process,

DATA PREPARATION & EDA

Data Collection Process.

The major chuck of data that has been utilized in this project has been taken from Kaggle, uploaded by VISHWANATH SESHAGIRI [source]. (Click on the image below to view it in a seperate tab)

Youtube API was utilized in collecting youtube title information given the video ID. Title is one of the most essential components of the data. (Click on the image below to view it in a seperate tab)

Youtube API was utilized in collecting video thumbnail URLs given the video ID. Thumbnail is another really essential components of the data [Source code]. (Click on the image below to view it in a seperate tab)

API is used to download the images from the thumbnail URLS, the downloaded thumbnails are stored into a google drive link [Final uncleaned data]. The total time taken to collect 100k titles and thumbnails was 7 days. The API data collection will continue to fetch at least 300k data points. (Click on the image below to view it in a seperate tab)

Exploratory data analysis.

This section will take you on a journey in understanding the underlying data and cleaning it for analysis and modeling

To initiate a comprehensive understanding of the data, a word cloud (refer to Fig. 3) has been generated to provide an overview of the columns present in the dataset. It should be noted that the word cloud was created to emphasize words with higher repetition. It is evident that several 'Unnamed' columns may have been generated while storing data using an API key; these should be removed. Additionally, the 'index' column serves no purpose and can be excluded. Other columns should be retained for further analysis. Moving on to analyzing the amount of missing values in the most important columns of the data.

Fig. 3 Column names word cloud

Fig. 4 Proportion of missing values in titles column

After some cleaning, it is observed that none of the columns have missing values. To proceed with data preparation, a deeper examination of the columns is required. The following section will focus on the most important columns and aim to clean them, completing the data cleaning process.

The video view count (Refer Fig. 6) seems to follow a exponential distribution, to understand the distributions better, the histogram for the views under 8k views is plotted. It should be noted that there is a data point that falls below 0, which is not plausible as views cannot go below 0. This issue will be resolved by removing the single row containing this anomaly.

The histogram for views under 8k (refer Fig. 7) reveals that the majority of records fall within the bracket of 0 to 4000 views. To build a classification multimodal, achieving a more balanced distribution of data across various view brackets is essential. This aspect will be addressed in the upcoming sections of the research, where additional data will be collected to ensure a more even distribution.

Fig. 7 Histogram for video view count (zoomed in)

The box plot (Refer Fig. 8), although not providing detailed information about quantiles, primarily focuses on identifying outliers. It is observed that the dataset contains numerous outliers, with some particularly significant ones having a view count exceeding 50M. Notably, there are very few videos with more than 50M views. The potential impact of these outliers on the analysis will be assessed after collecting more data to determine if this pattern persists.

One of the more intriguing columns to observe is 'views/elapsedtime.' This column interprets views not merely as raw data but as a variable that changes over time. It aids in understanding whether a video gained views steadily over the long run or experienced immediate effects, and vice versa.

Fig. 10 Distribution of the views/elapsed time (zoomed in)

It is interesting to note that there are numerous videos with views exceeding the time elapsed by a factor of 100 to 5000 (Fig. 10). This is a notable observation, suggesting that these videos gained traction very quickly, possibly due to a higher number of subscribers on the channel. The audience's immediate interest in upcoming videos could explain the rapid viewership. However, as our focus is on assessing the impact of titles and views on overall viewership, such cases may not accurately represent the relationship. Consequently, these rows will be removed from the dataset.

The data interestingly spans across the majority of YouTube categories (Fig. 11 and Fig. 12), exhibiting an almost equal distribution among key genres such as entertainment, people and blogs, gaming, and sports. It should also be noted that most of the genres have more than 2000 vidoes in the data containing 100k data points. This diversity is beneficial, as modeling the data can provide insights into the impact of titles and thumbnails not only within a single category but across a wide spectrum of categories.

At this point, the data is more or less clean, exhibiting no missing values and featuring appropriate data types with consistent values. While the dataset contains some outliers, they are retained at this stage as they might prove useful for the subsequent analysis. A partial snapshot of the cleaned dataset is displayed below (refer Fig. 13x). The image illustrates an additional column that has been introduced (VideoCategory), along with cleaned titles and thumbnail URL columns.

The two plots (Fig. 4 and Fig. 5) depict the number of missing values for the columns 'titles' and 'Thumbnail_URL'. It is observed that after the API data collection for 100k data points, almost 14% of the data had missing values. Removing rows with missing values is essential, as rows lacking the most important features for the research are practically not useful. The number of videos could be increased by collecting more data using the API, a step that will be taken in the further sections of the research work.

Fig. 5 Proportion of missing values in thumbnail URL column

Fig. 6 Histogram for video view count

Fig. 8 Box plot of video view count

Fig. 9 Distribution of the views/elapsed time

From Fig. 9, it must be noted that most of the values lie in the range of 0 to 1, which means that most of the videos present in the data has views less than the time elapsed after posting the video. Which is obvious as majority of the videos lied in the view brackets of 0 to 4k and mostly channel elapsed time woud be greater than the number of views the video gained. But the intesting points are the ones from 100.

Fig. 11 Videos per video categoryID

Fig. 12 Videos per video category

Fig. 13 Cleaned dataset

ML modeling

K-means & Hierarchical clustering

Overview

From prior research, the relationship between title length and views suggests that title length does play a role in influencing the number of views a video gains. Though the relation is not directly causal, it could be said that title length does influence views. To confirm this hypothesis, k-means clustering is performed on the data. K-means clustering is a form of unsupervised learning which helps in deriving the clusters of data that are related to one another by using a distance metric. The distance metric used for this method is the Euclidean distance. Euclidean distance takes into consideration the shortest distance between data points and thus helps in finding the closest points. The formula is provided in Fig. 14. K-means clustering is a form of partition clustering which is an algorithm that divides the dataset into sub categories based on certain criteria such as the shortest distance between the clusters and the centroid.

To advance the analysis, titles undergo embedding using a VGG16 network (see Fig. 15) to extract key features from the text. The VGG16 network, commonly utilized for image recognition, is repurposed for text analysis. The title embeddings are then subjected to hierarchical clustering to identify the closest titles. Notably, the 'views' variable is not utilized in this method. Hierarchical clustering offers advantages over k-means clustering, as it does not require a fixed number of clusters to be specified, allowing for the natural structure of the data to emerge. The approach employs divisive clustering, where each data point initially forms a single cluster and subsequently divides into smaller clusters. The distance metric utilized for this method is cosine similarity, chosen for its effectiveness in measuring the similarity between embedding vectors (see formula in Fig 16).

Fig. 14 Euclidean Distance Formula intro_1

Fig. 15 VGG16 Architecture intro_1

Fig. 16 Cosine Similarity formula

To determine the influence of title length on views, the 'title_length' feature is derived by counting the number of characters in the title, while the 'views/elapsed time' ratio is considered instead of views to focus on the impact of titles relative to the time since publication. The 'views/elapsedtime' column is analyzed for outliers and subsequently removed to ensure accurate clustering. Thus, for k-means clustering, a subset of the dataset containing the 'title_length' and 'views/elapsedtime' features is used (refer Fig.17). Title embeddings are obtained by first passing the titles through a pre-trained VGG16 network with its fully connected layer removed to extract key features. The resulting features are then stored in the 'title_embedding' column. These title embeddings, along with the 'views/elapsedtime' features, are used for hierarchical clustering. Although 'views/elapsedtime' is not directly used in the clustering algorithm, it is utilized post-clustering as labels to analyze the pattern that emerges from the data. The link to the sample data can be found here.

intro_1 Fig. 17 Clustering Data

Fig. 18 Sample of original data

Code:

The code to k-means clustering performed on python can be found here, and the code to hierarchical clusterinf can be found here

Results

The k-means clustering experiment generated the following results:

The experiment was run on K-values ranging from 2 to 10 and the results were recorded as shown in Fig.19 and Fig. 20. K values greater than 3 generated less than 1 inertia and and gave a silhouette score less than 0.53, hence, both the silhouette score and the elbow method indicated that k=2 was the optimal value for the clustering algorithm (refer Fig. 20 and Fig. 19). The silhouette score for k=2 was obtained as 0.57, suggesting that the majority of the points belong within the clusters with a minor overlap between them. K-means clustering was conducted on 'title_length' and 'views/elapsedtime' ratio with k=2, using a boundary of title length=50 to obtain distinct clusters. The clusters were centered at (30.18, 0.40) and (62.45, 0.50), indicating a significant relationship between YouTube title length and views (refer Fig. 21).

Fig. 21 K-means clustering perfomed for 2 clusters

The hierarchical clustering experiment generated the following results:

Three clusters were obtained while clustering the title embeddings. The first cluster primarily included titles resulting in low 'views/elapsedtime' ratios. The second cluster comprised titles with predominantly high 'views/elapsedtime' ratios. The third cluster yielded mixed results, with a combination of low and high 'views/elapsedtime' ratios.

Comparing Hierarchical and K-means clustering, it can be observed that there exist two clusters (though hierarchical provided three clusters the third one can be ignored as it does not capture the video performance based on views) and the video titles fall into each of these clusters based on its characteristics.

Fig. 19 Elbow method to find optimal K value intro_1

Fig. 20 Silhoutte method to find optimal K-value

Fig. 22 Dendogram with hclust performed on title embeddings

Conclusion

Based on the results, it is observed that titles with more than 50 characters had a mean 'views/elapsedtime' of 0.5 and a mean title length of 62 characters. In contrast, titles with fewer than 50 characters displayed a mean 'views/elapsedtime' of 0.4 and a mean title length of 30 characters. This suggests that videos with shorter titles performed relatively poorly compared to videos with longer titles. It is worth noting that the boundary lies at 50 characters, and given that the maximum length of a YouTube title is 100 characters, the clusters clearly differentiate shorter titles from longer titles.

The hierarchical clustering revealed three clusters based on title embeddings, each exhibiting varying views-to-elapsed-time ratios. The first cluster displayed a lower mean views-to-elapsed-time ratio, while the second cluster exhibited a higher mean ratio. The third cluster presented a mixture of views-to-elapsed-time ratios. This suggests that titles within the second cluster corresponded to videos with better performance compared to those in the other two clusters.

In conclusion, the results of the cluster experiment indicate that titles indeed have an impact on YouTube video performance.